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Abstract. This paper studies the optimization of observation channels (stochastic kernels) in 
partially observed stochastic control problems. In particular, existence and continuity properties 
are investigated mostly (but not exclusively) concentrating on the single-stage case. Continuity 
properties of the optimal cost in channels are explored under total variation, setwise convergence, 
and weak convergence. Sufficient conditions for compactness of a class of channels under total 
variation and setwise convergence are presented and applications to quantization are explored. 
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1. Introduction. In stochastic control, one is often concerned with the following 
problem: Given a dynamical system, an observation channel (stochastic kernel), a cost 
function, and an action set, when does there exist an optimal policy, and what is an 
optimal control policy? The theory for such problems is advanced, and practically 
significant, spanning a wide variety of applications in engineering, economics, and 
natural sciences. 

In this paper, we are interested in a dual problem with the following questions to 
be explored: Given a dynamical system, a cost function, an action set, and a set of 
observation channels, does there exist an optimal observation channel? What is the 
right convergence notion for continuity in such observation channels for optimization 
purposes? The answers to these questions may provide useful tools for characterizing 
an optimal observation channel subject to constraints. 

We start with the probabilistic setup of the problem. Let X C R™, be a Borel 
set in which elements of a controlled Markov process {X t , t G Z + } live. Here and 
throughout the paper Z + denotes the set of nonnegative integers and N denotes the 
set of positive integers. Let Y C K m be a Borel set, and let an observation channel Q 
be defined as a stochastic kernel (regular conditional probability) from X to Y, such 
that Q( ■ \x) is a probability measure on the (Borel) cr-algebra £>(Y) on Y for every 
x G X, and Q(A\ ■ ) : X —> [0, 1] is a Borel measurable function for every A G B(Y). 
Let a decision maker (DM) be located at the output an observation channel Q, with 
inputs X t and outputs Y t . Let U be a Borel subset of some Euclidean space. An 
admissible policy H is a sequence of control functions {74, t G Z + } such that 74 is 
measurable with respect to the cr-algebra generated by the information variables 

h = {Y m , %*-!]}, teN, I = {Y }. 

where 

U t = lt {i t ), tez+ (1.1) 
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are the U-valued control actions and we used the notation 

Y m ={Y Sl 0<s< t}, Ufat-q ={U s ,0<s<t- 1}. 

The joint distribution of the state, control, and observation processes is determined 
by (jl.ip and the following relationships: 

Pr((X ,y ) 6fl) = / P(dx )Q{dy \x ), B e B(K x Y), 
Jb 

where P is the (prior) distribution of the initial state Xq, and 

Pr{(X t ,Y t ) e B ^[ ,t-i] = X[Q,t-i], Y [a,t-i] = V{o,t-i], u [o,t-i] = u [o,t-i]^ 

= f P{dx t \xt-i,Ut-i)Q(dyt\xt), BeB(XxY), t e N, 
Jb 

where P(-\x, u) is a stochastic kernel from X x D to X. 

One way of presenting the problem in a familiar setting is the following: Consider 
a dynamical system described by the discrete-time equations 

X t+1 = f(X t ,U t ,W t ), 
Y t =g{X u V t ) 

for some measurable functions /, g, with {Wt} being independent and identically dis- 
tributed (i.i.d) system noise process and {Vt} an i.i.d. disturbance process, which are 
independent of Xq and each other. Here, the second equation represents the com- 
munication channel Q, as it describes the relation between the state and observation 
variables. 

With the above setup, let the objective of the decision maker be the minimization 
of the cost 



j(p,Q,n) = £^< n 



T-l 



^2c(X t) U t ) 



(1.2) 



over the set of all admissible policies IT, where c:Xxl)->Kisa Borel measurable 
cost function and Ep' U denotes the expectation with initial state probability measure 
given by P under policy II and given channel Q. We adapt the convention that 
random variables are denoted by capital letters and lowercase letters denote their 
realizations. Also, given a probability measure /i the notation Z ^ /i means that Z 
is a random variable with distribution /i. Finally, let be the set of all admissible 
policies II described above. 

We are interested in the following problems: 
Problem PI. Continuity on the space of channels (stochastic kernels) 
Suppose {Q ni n S N} is a sequence of communication channels converging in some 
sense to a channel Q. When does 

Qn -> Q 

imply 

inf J(P,Q n> U)^ inf J(P,Q,n)? 
n e n g 
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Problem P2: Existence of optimal channels Let Q be a set of communication 
channels. When do there exist minimizing and maximizing channels for the problems 

T-l -, 

t=0 



and 

sup inf E' 

If solutions to these problems exist, are they unique? 

Problems PI and P2 are challenging even in the single-stage (T = 1) setup and 
in most of the paper we consider this case. Admittedly, the multi-stage case is more 
important and we briefly consider this case is Section[5]at the end of the paper. Future 
work is needed to fully address this technically more complex case. 

The answers to problems PI and P2 may help solve problems in application areas 
such as: 

• For a partially observed stochastic control problem, sometimes we have con- 
trol over the observation channels by encoding/quantization. When does 
there exist an optimal quantizer for such a setup? (Optimal quantization) 

• Given an uncertainty set for the observation channels, can one identify a 
worst element /best element? (Robust control) 

• When estimating channels from empirical observations, under quite general 
assumptions estimations converge to the actual distribution, in some sense. 
For example, if an observation channel has the form Y t = X t + Vt, where 
the independent noise V t has a density, nonparametric density estimation 
methods lead to convergence in total variation, whereas for the general case, 
the empirical measures converge weakly with probability one |10) . [T? . Do 
these modes of convergence imply that we could design the optimal control 
policies based on empirical estimates, and does the optimal cost converge 
to the correct limit as the number of measurements grows? (Consistency of 
empirical controllers) 

In the following, we will address problems PI and P2 and introduce conditions 
under which we can provide affirmative/conclusive answers. 

1.1. Relevant literature. The problems stated are related to three main areas 
of research: Robust control, optimal quantizer design and design of experiments. 

References [5] [23 12H] have considered both optimal control and estimation and 
the related problem of optimal control design when the channel is unknown. In 
particular, |28j studies the existence of optimal continuous estimation policies and 
worst-case channels under a relative entropy constraint characterizing the uncertainty 
in the system. In [26], the total variation norm is considered as the measure of the 
uncertainty, and the inf-sup policy is determined (thus, the setup considered as a 
min-max problem for the generation of optimal control policies) . Similarly, there are 
connections with robust detection, such as those studied by Huber [3T] and Poor [33] , 
when the source distribution to be detected belongs to some set. 

A related area is on the theory of optimal quantization: References [1] , [16] are 
related as these papers study the effects of uncertainties in the input distribution 
and consider robustness in the quantizer design. References [23] and [23] study the 



inf inf E$' 



5>(*t>Ct) 
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consistency of optimal quantizers based on empirical data for an unknown source. In 
the context of decentralized detection, [35] studied certain topological properties and 
the existence of optimal quantizers. We will regard the quantizers as a particular class 
of channels, and look for such optimal channels. One by-product of our analysis will be 
a new approach to obtain conditions for the existence of optimal quantizers for a given 
class of cost functions under mild conditions. We also note that, regarding connections 
with information theory, some discussions on the topology of information channels are 
presented in [33J . Recently, [31] considered continuity and other functional properties 
of minimum mean square estimation problems under Gaussian channels. 

As mentioned earlier, in most of the paper we consider the single-stage case. We 
will also briefly consider the technically more complex multi-stage case in Section [5] 
where further conditions on the controlled Markov chain must be imposed. The full 
development of this general setup is the subject of future work. 

The rest of the paper is organized as follows. In the next section, we introduce 
three relevant topologies on the space of communication channels. The continuity 
problem is considered in Section [3J We study the problem of existence of optimal 
channels in Section |4] followed by applications on quantization in Section [5] Section [6] 
gives an outlook to the multi-stage setup. The paper ends with the concluding remarks 
and discussions in Section 

2. Some topologies on the space of communication channels. One ques- 
tion that we wish address is the choice of an appropriate notion of convergence for a 
sequence of observation channels. Toward this end, we first review three notions of 
convergence for probability measures. 

Let "P(R N ) denote the family of all probability measure on (X, S(R )) for some 
N G N. Let {fi n , « € N} be a sequence in V(M. N ). Recall that {fi n } is said to converge 



to n g 



N 



) weakly if 



c(x)/j, n (dx) — > / c(x)n(dx) 



for every continuous and bounded c 
converge to jji G V(M. N ) setwise if 



On the other hand, {/in} is said to 



c(x)fi n (dx) — > / c(x)fi(dx 



for every measurable and bounded c : R N —> HL Setwise convergence can also be 
defined through pointwise convergence on Borel subsets of M. N (see, e.g., [2D])) that is 

dJV\ 



Hn{A) ->• n(A), for all A e 

since the space of simple functions is dense in the space of bounded and measurable 
functions under the supremum norm. 

For two probability measures fi, v G V(R N ), the total variation metric is given by 

Wh-uWtv :=2 sup \fi{B)-u(B)\ 

BeB(R N ) 



sup 

/: 11/11 oo<l 



f(x)n(dx) 



f(x)v(dx) 



(2.1) 



where the supremum is over all measurable real / such that ||/||oo = sup^ggw < 
1. A sequence is said to converge to /u G "P(M. ) in total variation if ||/x n — /Lt||xv — > 
0. 



■5 



Setwise convergence is equivalent to pointwise convergence on Borel sets whereas 
convergence in total variation requires uniform convergence on Borel sets. Thus con- 
vergence in total variation implies setwise convergence, which in turn implies weak 
convergence. It follows that the induced topologies are of decreasing order of strength, 
with the topology induced by convergence in total variation being the strongest and 
the topology induced by weak convergence being the weakest, with the topology in- 
duced by setwise convergence is in between these two. The topologies corresponding 
to convergence in total variation and weak convergence are metrizable (the natural 
metric for total variation convergence is d{ji, v) = v\\tv] the usual choice for weak 
convergence is the Prohorov metric [4]). The topology induced by setwise convergence 
is not first countable, so it is not metrizable (see, e.g., [21 Prop. 2.2.1]). 




2.1. Convergence of information (observation) channels. Here X = M. n 
and Y = R m , and Q denotes the set of all observation channels (stochastic kernels) 
with input space X and output space Y. For P G 'P(X) and Q G Q we let PQ 
denote the joint distribution induced on (X x Y, B(K x Y)) by channel Q with input 
distribution P: 



Definition 2.1 (Convergence of Channels). 

(i) A sequence of channels {Q n } converges to a channel Q weakly at input P 
if PQ n PQ weakly. 

(ii) A sequence of channels {Q n } converges to a channel Q setwise at input P 
if PQn -> PQ setwise, i.e., if PQ n {A) PQ(A) for all Borel sets A C X x Y. 

(iii) A sequence of channels {Q n } converges to a channel Q in total variation at 

input P if PQn PQ in total variation, i.e., if \\PQ n — PQ\\tv 0. 

p 

If we introduce the equivalence relation Q = Q if and only if PQ = PQ', 
Q, Q' G Q, then the convergence notions in Definition 12.11 only induce the corre- 
sponding topologies (resp. metrics) on the resulting equivalence classes in Q, instead 
of Q. Since in most of the development the input distribution P is fixed, there should 
be no confusion when (somewhat incorrectly) we talk about the induced topologies 
(resp. metrics) on Q. 

The preceding definition involved the input distribution P. The next lemma gives 
sufficient conditions which may be easier to verify. The proof is given in the Appendix. 

Lemma 2.2. 

(i) // {Q n (- \x)} converges to Q( - \x) weakly for P-a.e. x, then PQ n — > PQ 



(ii) If {Qn( - \x)} converges to Q( -\x) setwise for P-a.e. x, then PQ n — > PQ 
setwise. 

(iii) If {Qn{ ■ \x)} converges to Q{- \x) in total variation for P-a.e. x, thenPQ n — > 
PQ in total variation. 

The conditions in Lemma 12.21 are almost universal in the choice of input proba- 
bility measures; that is, the convergence characterizations will be independent of the 
input distributions if each of the conditions is replaced with convergence of {Q n ( • \x)} 
to Q( ■ \x) for all x G X. This is particularly useful when the input distribution 
is unknown, or when the input distributions may change. The latter can occur in 
multi-stage stochastic control problems. 





weakly. 



() 



Example 2.3. 

(i) Consider the case where the observation channel has the form Y t = X t + Vt , 
where {V t } is an i.i.d. noise (disturbance) process. Suppose V t ~ fg for some 9g G O, 
where C R d is a parameter set and {fg : 9 G 6} is a parametric family of n- 
dimensional densities such that f$ n (v) — > fg„(v) for all v G K" and any sequence of 
parameters 0„ such that 0„ — > 9q. Then by Scheffe's theorem fg n converges to fg 
in the L\ sense, and consequently, the sequence of corresponding additive channels 
Q n ( ■ \x), defined by 

Q n (A\x) = f f 6n (z - x) dz, A G B{R n ) 

J A 

converges to the channel Q{ ■ \x) (corresponding to fg) in total variation for all x. 

(ii) Consider again the observation channel Y t = X t + Vt , but assume this time 
that we only know that Vt has a density / (which is unknown to us) . If we have access 
to independent observations Vi, . . . , V n from the noise process, then we can use any 
of the consistent nonparametric methods, e.g., |10j . to obtain an estimate /„ which 
converges (with probability one) to / in the L\ sense as n — > oo. More explicitly, 
letting (D,,A, P) be the probability space on which the independent observations {Vi] 
are defined, for any lj £ CI, the estimate f n = f n>u is a pdf on W 1 , and there exists 
A E A with V(A) = 1 such that J \f n , u (z) - f n {z)\ dz ->• as n — > oo for all u> G A. 
The estimated channel Q n { ■ \x) = Q n ,u( • \%) corresponding to f n>u converges to the 
true channel Q{ ■ \x) in total variation for all x with probability one. More explicitly, 
for any u) G A, Q n _ u { ■ \x) converges to Q( ■ \x) in total variation as n — > oo for all x. 

(iii) Now suppose that the observation channel Q is such that Q( ■ \x) admits a 
conditional density f(y\x) for all x G K™. Given observations (Xj., Y n ), . . . , (X n , Y n ) 
drawn independently from the distribution PQ, there exists a sequence of nonpara- 
metric conditional density estimates f n (y\x) such that 

J \f n (y\x)-f(y\x)\dy\p(dx)^0 

with probability one |17j . This immediately implies that the channels Q n correspond- 
ing to these estimates converge to Q in total variation at input P. 

(iv) Finally, assume again the additive model Y t = X t + Vt, where now we do not 
have any information about the distribution /i of Vt . In this case there are no methods 
to consistently estimate /i in total variation from independent samples V\ , . . . , V n 
[llj . However, the empirical distribution fi n of the samples converges weakly to fi 
with probability one [13]. The corresponding estimated observation channels Q n ( ■ \x) 
converge weakly to the true channel Q( ■ \x) for all x with probability one. 

2.2. Classes of assumptions. Throughout the paper the following classes of 
assumptions will be adopted for the cost function c and the (Borel) set D C K 1 in 
different contexts: 
Assumptions. 

Al. The function c : X x D -> R is non-negative, bounded, and continuous on 
X x U. 

A2. The function function c : X x U — > R is non-negative, measurable, and 
bounded. 

A3. The function c : X x U — > R is non-negative, measurable, bounded, and 
continuous on U for every x G X. 




7 



A4. U is a compact set. 
A5. U is a convex set. 

3. Problem PI: Continuity of the optimal cost in channels. In this sec- 
tion, we consider continuity properties under total variation, setwise convergence and 
weak convergence. We consider the single-stage case, and thus investigate the conti- 
nuity of the functional 



in the channel Q, where Q is the collection of all Borel measurable functions mapping 
Y into U. Note that by our previous notation, II = 7 is an admissible first-stage 
control policy. As before, in this section Q denotes the set of all channels with input 
space X and output space Y. 

Total variation is a stringent notion for convergence. For example a sequence 
of discrete probability measures never converges in total variation to a probability 
measure which admits a density function with respect to the Lebesgue measure. On 
the other hand, setwise convergence induces a topology on the space of probability 
measures and channels which is not easy to work with. This is mainly due to the 
property that the space under this convergence is not metrizable. However, the space 
of probability measures on a complete, separable, metric (Polish) space endowed with 
the topology of weak convergence is itself a complete, separable, metric space [I]. 
The Prohorov metric, for example, can be used to metrize this space. This metric has 
found many applications in information theory and stochastic control. Furthermore, 
there are well-known conditions to identify whether a family of probability measures is 
weakly compact [4]. For these reasons, one would like to work with weak convergence. 
However, as we will observe, weak convergence is insufficient in a general setup for 
obtaining continuity. 

Before proceeding further, however, we look for conditions under which an optimal 
control policy exists; i.e, when the inhmum in inf 7 Ep' 7 [c(X, U)] is a minimum. The 
following simple result is proved in the Appendix. 

Theorem 3.1. Suppose assumptions A3 and A4 hold. Then, there exists an 
optimal control policy for any channel Q. 

Remark. The assumptions that c is bounded and U is compact can be weakened 
in the preceding theorem. For example, one can prove the same result by assuming 
that U = M fe , limnun^oo c( x,u) = 00 for all x, c(x, u) is lower semi-continuous on U 
for every x, and there exists uq such that J c(x,Ua)P(dx) < 00. 

3.1. Weak convergence. 

3.1.1. Absence of continuity under weak convergence. The following coun- 
terexample demonstrates that J(P, Q) may not be continuous under weak convergence 
of channels even for continuous cost functions and compact X, Y, and ILL Note that 
the absence of continuity here is also implied by a less elementary counterexample for 
setwise convergence in Section f3. 2 .11 

Let X = Y = U = [a,b] for some a, b 6 R, a < b. Suppose the cost is given as 
c(x,u) = (x — u) 2 and assume that P is a discrete distribution with two atoms: 



J(P,Q) = inf E% 



[c{X ,Uo)] 




P = - 2 6a + ~ 2 S b , 
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where S a is the delta measure at point a, that is, S a (A) = l^ ae ^y for every Borel set 
A, where 1^ denotes the indicator function of event E. Let {Q n } be a sequence of 
channels given by 

QnH*) = {j** ^-^f (3-D 
[d Q it x<a+-. 

In this case, the optimal control policy, which is unique up to changes in points 
of measure zero, is 

ln{y) = al {y<a+ x } + bl {y > a+ x } , neN, n>^—, 

leading to a cost of 0. We observe that the limit of the sequence {Q n ( ■ \x)} is given 

by 

Q(.\x)=S a forallxeM. (3.2) 

Thus, by Lemma l2.2| Q n — > Q weakly at input P. However, the limit of the sequence 
of channels cannot distinguish between the inputs, since the channel output always 
equals a. Thus, even though 

J(P,Qn) = 0, for all n>-—, 

o — a 

the cost of Q = lim„ Q n is 

since, letting (X, Y) - PQ, we have j(y) = E[X\Y = y] = {b + o)/2 for all y. □ 

3.1.2. Upper semi-continuity under weak convergence. 

Theorem 3.2. Suppose assumptions Al and A5 hold. If {Qn} is a sequence of 
channels converging weakly at input P to a channel Q, then 

lim sup J(P,Q n ) < J(P,Q), 

that is, J(P, Q) is upper semi- continuous on Q under weak convergence. 

Proof. Let /z be an arbitrary probability measure on (X x Y, £>(X x Y)) and let hy 
be its second marginal, i.e., hy(A) = /i(X x A) for A £ B(Y). Let g£ Q be arbitrary. 
By Lusin's theorem [371 Thm. 2.24] there is a continuous function^ / : Y — > U such 
that 

MY{y : f(y) g(y)} < e- 

Letting B = {y : f{y) ^ g(y)} we obtain 

c(x,g(y)) - c(x,f(y))\n(dx,dy) = / \c(x,g(y)) - c(x, f(y))\/j,(dx, dy) 

JXxB 



1 Lusin's theorem as stated in [27] implies the statement for U = R. The extension to the case 
U = M. K is straightforward. If U is any closed and convex subset of M. K , then there is a continuous 
function tt : R K — > U such that tt(u) = u on U (the metric projection onto U). Then / = tt o f is the 
desired continuous mapping from Y into U. 



< e-c*, 



where c* = sup x u c(x, u) < oo by assumption Al, so that 

c ( x J(y))Kdx,dy)< c(x,g(y))n(dx,dy) + ce. (3.3) 



Let C be the set of continuous functions from Y into U, define 



j(n,C) = inf / c{x,j(y))fi(dx,dy), j(jJ.,Q) = inf / c(x,^(y))fi(dx,dy) 

and note that j(fi,C) > j(fi,Q) since C C Q. By ()3.3|> . j(fJ>,C) is upper bounded 
by the right-hand-side of (I3.3[) . Since g in (|3.3p was arbitrary, we obtain j(/j,,C) < 
j(fi, Q) + c*e, which in turn implies C) < Q) since e > was arbitrary. Hence 

i( M ,c) = j{n,g). 

Applying the above first to PQ n and then to PQ, we obtain 

limsupinf / c(x, ~f(y))PQ n (dx, dy) = limsup inf / c(x, f(y))PQ n (dx,dy) 

fee J 

< inf limsup / c(x, f(y))PQ n (dx,dy) 

f€C n— s-oo J 

= mJ c(x,f(y))PQ(dx,dy) 

= inf / c(x : j(y))PQ(dx,dy) 
ieg J 

where the next to last equality holds since PQ n converges weakly to PQ. □ 
3.2. Continuity properties under setwise convergence. 

3.2.1. Absence of continuity under setwise convergence. The following 
counterexample demonstrates that J(P, Q) may not be continuous under setwise con- 
vergence of channels even for continuous cost functions and compact X, Y, and U. 

Let X = Y = U = [0,1]. Assume that X has distribution 

P = \so + l<k. 

Let Q( ■ \x) = U([0, 1]) for all x, so that if (A, Y) - PQ, then Y is independent of X 
and has the uniform distribution on [0, 1]. Let c(x, u) = (x — u) 2 . 
By independence, E[X\Y] = E[X] = 1/2, so 

J(P, Q) = mm E[{X - 7 (F)) 2 ] = E[(X - E[X\Y]f] 

=K i_ £) 2+ K Q "C) 2 =z- 

For n G N and k — 1 , . . . , n consider the intervals 
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and define the "square wave" function 

n 

M*) =X]( 1 {*eW _ 1 {teR nk })- 
fe=i 

Since J Q /i n (t) dt — and |ft. n (£)| < 1, the function 

/ n (t) = (l + /l n (i))l {te[0 ,l] } 

is a probability density function. Furthermore, the proof of the Riemann-Lebesgue 
lemma (for example [31], Thm. 12.21) can be used almost verbatim to show that 

lim / h n (t)g(t) dt = for all g G £i([0, 1],R) 

and therefore 

lim / f n (t)g(t) dt = [ g(t)dt for all c? G l-i([0, 1],K). (3.5) 
n^ocJ Q J Q 

In particular, we obtain that the sequence of probability measures induced by the 
sequence {/„} converges setwise to f7( [0, 1] ) . 
Now, for every n, define a channel as 



C/([0,1]), x = 



Then Q n (-\x) — > Q setwise for x = and x = 1, and thus PQ n — > PU([0, 1]) setwise. 
However, letting (X,Y n ) ~ PQ„, a simple calculation shows that the optimal policy 
for PQ n is 



Tn (y) = £LY|Y„ = y] = <j 

and therefore for every n G N 

J(P,Q„) = min£[(X- 7 (Y„)) 2 ] 

765 



o, yeULi^ 



(0 - 7n (y)) 2 dy + - / (l-7n(2/)) 2 /n(2/)rfy 



l 

~ 2 
_ 1 
~~ 6' 

Thus, the optimal cost value is not continuous under setwise convergence. □ 

3.2.2. Upper semi-continuity under setwise convergence. 

Theorem 3.3. Under assumption A2 the optimal cost 

J{P,Q) := ME^ldX, U)] 

i 

is sequentially upper semi- continuous on the set of communication channels Q under 
setwise convergence. 
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Proof. Let {Q n } converge setwise to Q at input P. Then 
limsup inf / c(x,~f(y))PQ n (dx,dy) < inf limsup / c(x,j(y))PQ n (dx,dy) 

n->oo 765 J 755 n->oo J 



= inf / c(x,j(y))PQ(dx,dy), 
teg J 

where the equality holds since c is bounded. □ 
3.3. Continuity under total variation. 

Theorem 3.4. Under assumption A2 the optimal cost J(P,Q) is is continuous 
on the set of communication channels Q under under the topology of total variation. 

Proof. Assume Q n — > Q in total variation at input P. Let e > and pick 
the e-optimal policies 7„ and 7 under channels Q n and Q, respectively. That is, 
letting J(Q',y) = Ef n '[c(X,U)} for any 7 'e5 and Q' E Q, we have J(Q„,7„) < 
J{P, Q n ) + e and J(Q, 7) < J(P, Q) + e. 

Considering first the case J(P, Q n ) < J(P, Q), we have 

J(P, Q) - J(P, Q n ) < J(P, Q) - J{Q n , ln ) + e 

< J(Q,Jn) ~ J(Qn,Jn) + £• 

By a symmetric argument it follows that 

\J(P,Q) - J{P,Q n )\ < max(j(Q, 7 „) - J(Q n ,7n)J(Q n ,l) - J(fi,y))+e (3.6) 
Now, since c is bounded, it follows from (|2.ip that for any 7' G Q, 

|J(Q„, 7 ') - J(Qn')\ = J c(x,i(y))PQ n {dx,dy) - J c(x n '(y))PQ(dx,dy) 

< ||c||oo||PQ n -PQ||TV. 

This and J3U) imply \ J(P,Q n ) - J(P,Q)\ < M^PC},, - PQ\\tv + e. Since e > 
was arbitrary, we obtain | J{P, Q n ) — J(P, Q)\ < \\c\\oa\\PQn — PQ\\tv- Since ||PQ„ — 
PQWtv -> by Lemma l2~2l we obtain J(P, Q n ) -> J(P, Q) as claimed. □ 

4. Problem P2: Existence of optimal channels. Here we study characteri- 
zations of compactness which will be useful in obtaining existence results. 

The discussion on weak convergence showed us that weak convergence does not 
induce a strong enough topology, i.e., under which useful continuity properties can be 
obtained. In the following, we will obtain conditions for compactness for the other two 
convergence notions, that is, for setwise convergence and total variation. We note that 
in the topologies induced by these three modes of convergence, notions of compactness 
and sequential compactness coincide (for total variation and weak convergence this 
follows from metrizability; for setwise convergence see [6l Thm. 4.7.25]). 

We first discuss setwise convergence. A set of probability measures Ai on some 
measurable space is said to be setwise precompact if every sequence in M. has a 
subsequence converging setwise to a probability measure (not necessarily in M). For 
two finite measures v and fi defined on the same measurable space we write v < fi if 
< n(A) for all measurable A. 

We have the the following condition for setwise (pre)compactness: 

Lemma 4.1 (|2 Thm. 4.7.25]). Let fj, be a finite measure on a measurable space 
(T, A). Assume a set of probability measures \& C V(T) satisfies 

P < Mi f° r al1 p e 
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Then ^ is setwise precompact. 

As before, PQ G ?(XxY) denotes the joint probability measure induced by input 
P and channel Q, where X = R" and Y = R m . A simple consequence of the preceding 
majorization criterion is the following. 

Lemma 4.2. Let v be a finite measure on 23(X x Y) and let P be a probability 
measure on B(X). Suppose Q is a set of channels such that 

PQ < v, for all Q G Q. 

Then Q is setwise precompact at input P in the sense that any sequence in Q has a 
subsequence {Q n } such that Q n — > Q setwise at input P for some channel Q. 

Proof. By Lemma [4.11 the set of joint measures M = {PQ ■ Q G Q} is setwise 
precompact, that is, any sequence in M. has a subsequence {PQ n } converging to 
some P setwise. Furthermore, since the first marginal of PQ is P for all n, the first 
marginal of P is also P (since PQ n (A x X) -> P(A x X) for all A G B(X)). Now let 
Q be a regular conditional probability measure satisfying P — PQ. □ 

For a probability density function p on R^ we let P p denote the induced proba- 
bility measure: P P {A) = f.p(x)dx, A G B(M. N ). The next lemma gives a sufficient 
condition for precompactness under total variation. 

Lemma 4.3. Let \i be a finite Borel measure on M. N and let T be an equicontinuous 
and uniformly bounded family of probability density functions. Define ^ C V(JSL ) by 

* = {P p :P p <H,pe J 7 }. 

Then ^ is precompact under total variation. 

Proof. By Lemma l4.1[ ^ is setwise precompact and thus any sequence in ^ has 
a subsequence {P n } such that P n — > P setwise for some P e V(R N ). P is clearly 
absolutely continuous with respect to the Lebesgue measure on M. N , and so it admits 
a density p. 

Let p n be the density of P n . It suffices to show that 

lim ||p„-p||i = (4.1) 

since \\p n - p\\ T v = 2 lb« = 2 J \Pn(%) ~p(x)\dx. 

Pick a sequence of compact sets Kj C R^ such that Kj C i^j+i for all j £ N, 
and [J • Kj — W N . Since the collection of densities {p n } is uniformly bounded and 
equicontinuous, it is precompact in the supremum norm on each Kj by the Arzela- 
Ascoli theorem [13] . Thus there exist subsequences {p n j } such that 

lim sup \p n i(x) -p J {x)\ =0 

fe^OO x( z K . k 

for some continuous pP : Kj — > [0, oo). 

Since the Kj are nested, one can choose {p j+i} to be a subsequence of {p n j} 

k k 

for all j G N. Then pp +1 coincides with p? on Kj and we can define p on R^ by 
setting p(x) — p 3 (x), x G Kj. We can now use Cantor's diagonal method to pick an 
increasing sequence of integers {mi} which is a subsequence of each {n 3 k }, and thus 

lim p„ H (x) = p(x), for all x E R w . (4.2) 

i— >oo 

Note that by construction the convergence is uniform on each Kj (and p is continuous). 
By uniform convergence P Pm . (A) — > Pp(A) for all Borel subsets A of Kj. The setwise 
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convergence of P n to P p implies P„ m . (A) P p (A) for all Borel sets, so we must have 
p — p almost everywhere. This and (|4.2p imply via Scheffe's theorem [5] that 

\\Pmj -P\\l ->"0 

which completes the proof. □ 

The next result is an analogue of Lemma 14.21 and has an essentially identical 
proof. 

Lemma 4.4. Let Q be a set of channels such that {PQ : Q £ Q} is a precompact 
set of probability measures under total variation. Then Q is precompact under total 
variation at input P. 

The following theorem, when combined with the preceding results, gives sufficient 
conditions for the existence of best and worst channels when the given family of 
channels Q is closed under the appropriate convergence notion. 

Theorem 4.5. Recall problem P2. 

(i) There exist a worst channel in Q, that is, a solution for the maximization 
problem 

sup J{P,Q) = sup inf E^ n E[c(X,U)] 
QeQ QeQ T 

when the set Q is weakly compact and assumptions Al, A4, and A5 hold. 

(ii) There exist a worst channel in Q when the set Q is setwise compact and 
assumption A2 holds. 

(hi) There exist best and worst channels in Q, that is, solutions for the minimiza- 
tion problem infQ g Q J(P, Q) and the maximization problem supg 6 g J(P, Q) when the 
set Q is compact under total variation and assumption A2 holds. 

Proof. Under the stated conditions, we have upper semi-continuity or continuity 
(Theorems 13.21 13.31 and 13.41) under the corresponding topologies. By compactness, 
the existence of the cost maximizing (worst) channel follows when J(P, Q) is upper- 
semicontinuous, while the existence of the cost minimizing (best) channel follows when 
J(P, Q) is continuous in Q. □ 

Remark. The existence of worst channels is useful for the robust control or game- 
theoretic approach to optimization problems. If the problem is formulated as a game 
where the uncertainty in the set is regarded as a maximizer and the controller is the 
minimizer, one could search for a max-min solution, which we prove to exist. One 
could also look for min-max solutions, a topic which we leave as a future research 
topic. We note that, in information theory, problems of similar nature have been 
considered in the context of mutual information games [5]. 

5. Application: quantizers as a class of channels. Here we consider the 
problem of convergence and optimization of quantizers. We start with the definition 
of a quantizer. 

Definition 5.1. An M-cell vector quantizer, q, is a (Borel) measurable mapping 
from X = R™ to the finite set {1, 2, . . . , M}, characterized by a measurable partition 
{B>i, B'2, . . . , E>m} such that P>i = {x : q(x) = i} for i = 1, . . . , M. The P>i are called 
the cells (or bins) of q. 

Remarks. 

(i) For later convenience we allow for the possibility that some of the cells of 
the quantizer are empty. 

(ii) Traditionally, in source coding theory, a quantizer is a mapping q : M™ — > K 
with a finite range. Thus q is defined by a partition and a reconstruction value in R" 



14 



for each cell in the partition. That is, for given cells {Bi, . . . , £?m} and reconstruction 
values {ci, . . . , cm} C R n , we have q(x) = Ci if and only if x G _Bj. In our definition, 
we do not include the reconstruction values. 

A quantizer q with cells {-Bi, . . . , Bm}, however, can also be characterized as a 
stochastic kernel Q from X to {1, . . . , M}) defined by 

Q{i\x) = l {x< z Bl }, i = l,...,M 

so that q(x) = Yli=i Q(i\ x )- We denote by <2d(M) the space of all M-cell quantizers 
represented in the channel form. In addition, we let Q(M) denote the set of (Borel) 
stochastic kernels from X to {1, ...,M}, i.e., Q G Q(M) if and only if Q( ■ \x) is 
probability distribution on {1, . . . , M} for all x G X, and Q(i\ ■ ) is Borel measurable 
for alii = 1, . . . , M. Note that Q D (M) C Q(M), and by our definition Q D (M - 1) C 
Qfl(M) for all M > 2. We note that elements of Q(M) are sometimes referred to in 
the literature as random quantizers. 

Lemma 5.2. The set of quantizers Qd{M) is setwise precompact at any input 

P. 

Proof. Proof follows from Lemma 14.21 and the interpretation above regarding 
a quantizer as a channel. In particular, a majorizing finite measure v is obtained 
by defining v = P x A, where A is the counting measure on {1, . . . , M} (note that 
u(R n x {1, . . . , M}) = M). Then for any measurable B C K™ and i = 1, . . . , M, we 
have u(B x {«}) = P(B)X({i}) = P{B) and so 

PQ(B x {i}) = P{B n Bi) < P{B) = v{B x {i}). 

Since any measurable D C X x {1, . . . , M} can be written as the disjoint union of 
the sets Di x {«}, i = 1, . . . , M, with Di = {x G X : (x, i) G D}, the above implies 
PQ(D) < u(D). 

The following simple lemma provides a useful formula. 

Lemma 5.3. A sequence {Q n } in Q{M) converges to a Q in Q{M) setwise at 
input P if and only if 

I Q»(i\x)P{dx) -> / Q(i\x)P{dx) for all Ae B(X) and i = 1, . . . ,M . 

J A J A 



Proof. The lemma follows by noticing that for any Q G Q(M) and measurable 
Ddx {1,...,M}, 

PQ(U) - f Q(dy\x)P{dx) = J2 [ Q{i\x)P{dx) 

where Di = {x £ X : (x,i) G £>}. □ 

The following counterexample shows that the space of quantizers Qd(M) is not 
closed under setwise convergence: 

Let X = [0,1] and P the uniform distribution on [0,1]. Recall the definition 
L„k = [2|=* 2|=1) in (EU) and let S njl = U^i L nk and B n , 2 = [0, 1] \ B nA . Define 
{Qn} as the sequence of 2-cell quantizers given by 



Q n (l\x) = l{xeB„,i}j Qn(2|x) = l{ lEBll! }. 
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Then ([S3]) implies that for all A G B([0, 1]), 

lim / Q n (dy\x)P(dx) = lim / lf n ( t )dt = \p(A), 

n— ►oo J a n— >oo Z Z 

and thus, by Lemma 15.31 Q n converges setwise to Q given by Q{l\x) — Q{2\x) — \ 
for all x G [0, 1]. However, Q is not a (deterministic) quantizer. □ 

Definition 5.4. The class of finitely randomized quantizers Qfr{M) is the con- 
vex hull of Qd(M), i.e., Q G Qfr{M) if and only if there exist k G N, Qi, . . . ,Qk G 
Qd(M), and a%, . . . , ah G [0, 1] with 53i=i a i — lj such that 

k 

Q(i\x) = ctjQj(i\x), for all i = 1, . . . , M and x G X. 

3=1 

The next result shows that Qr{M) is the closure of the convex hull of Q_d(M). 

Theorem 5.5. For any Q G Q(M) there exists a sequence {Q n } of finitely 
randomized quantizers in Qfr{M) which converges to Q setwise at any input P. 

Proof. We will prove the existence of a sequence {Q n } in Qfr(M) such that 
Qn( ■ \x) — > Q{ ■ \x) setwise for all i£i 

Let V M = {z G R M : Z\ + ■ ■ ■ + z M = 1, z l > 0, i = 1,...,M} denote the 
probability simplex in K A/ and note that each Q G Q(A4) is uniquely represented by 
the function Q v : X — >• Vm defined by 

Q v (x) = (Q(l\x),Q(2\x), . . . ,Q(M\x)). 

For a positive integer n let Vm,ti be the collection of probability vectors in Vm with 
rational components having common denominator n, i.e., 

V M ,n ^{ze Vm ■ Zi G {0, 1/n, • . . , (n - l)/n, 1}, i = 1, . . . , M }. 

Clearly, any z G Pm can be approximated within error 1/n in the l^ sense by a 
member of Vm,u, i- e -> 

max min \\z — z'lloo — max min max \zi — z'A < — . 

Breaking ties in a predetermined manner, we can make the selection of z' for a given 
z unique, and thus define a Borel measurable mapping q n : Vm — > Vm,u such that 
z ' = 'Zn(^) approximates z in the above sense. Given Q G Q(A^), use this mapping 
to define Q n G Q(M) through the relation 

Ql{x) = q n (Q v (x)). 

(The measurability of Q(i\x) in x follows from the measurability of the mapping q n .) 
Let {z^\ . . . , z( L ( n )'} be an enumeration of those elements of Vu. n for which the sets 

S j = {x:Ql(x) = z^}, j = l,...,L(n) 

are not empty (clearly, L(n) < (n + 1) ). Note that the Si form a Borel-measurable 
partition of X and we have 

u:= (zW,z( 2 \...,z L W)£(V M ) L{n) 
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and 



: ' if a; € Sj. 



Viewed as a subset of R M ' L ("), the set (P M ) L{n) is compact and convex and therefore 
by the Krein-Milman theorem (see, e.g., [3]) it is the closure of the convex hull of 

its extreme points. The set of extreme points of (Vm) L ^ is {£m) \ where Em — 
{ei, . . . , &m\ is the standard basis for K M . In particular, we can find u\, . . . , u_/v G 



(£m) L ^ and (ai, . . . , ajv) G Vn such that llu — J2k=i a kUk\\ < ~ (|| • || denotes the 
standard Euclidean norm in any dimension). Since Uk = {uk,ii ■ • ■ i u k,L(n))> where 
Uk,j G £m for all k and j, we can define the deterministic quantizers Q n .k G Qd(-M), 
fc = 1, . . . , N, by setting 



Putting things together, we obtain that 

N 



k=i 



< 



for all x £ 



(5.1) 



Define Q n e Q(M) by 



JV 



Qn{i\x) = y ^2,OtkQn,k{i\x)- 



fc=l 



Combining (|5~T|) with ||Q"(a;) - Q^(x) ||oo < we obtain 



|Q(i|a;) — Q n (i|a;)| < — for all x G X and i = 1, . . 
n 



which implies that Q n { ■ \x) — > Q{ ■ \x) setwise for all x G X. Since each Q n is a convex 
combination of deterministic quantizers in Qd(M), the proof is complete. □ 

The preceding theorem has important consequences in that it tells us that the 
space of deterministic quantizers is a "basis" for the space of communication channels 
between X and {1, . . . , M} in an appropriate sense. In the following we show that 
an optimal channel can be replaced with an optimal quantizer without any loss in 
performance. 

Proposition 5.6. For any Q G Q(M) there is a, Q' G Qd{M) with J(P,Q') < 
J{P, Q). If there exists an optimal channel in Q(M) for problem P2, then there is a 
quantizer in Qd(M) that is optimal. 

Proof. Only the first statement needs to be proved. We follow an argument 
common in the source coding literature (see, e.g., the Appendix of [33 ). 

For a policy 7 : {1, . . . , M} — > U = X (with finite cost) define for all i, 

B l = {x: c(x,^{i))<c(x,j(j)), j = l,...,M}. 



Letting Bi = B x and B, = \ {j) =1 B 3 , 1 = 2,...,M, 
{Bi, . . . , Bm} and a corresponding quantizer Q' G Qd{M). 
£^'' 7 [c(X,J7)] < E^ n [c{X,U)} for any Q G Q(M). □ 



we obtain a partition 
It is easy to see that 
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The following shows that setwise convergence of quantizers implies convergence 
under total variation. 

Theorem 5.7. Let {Q n } be a sequence of quantizers in Qd(M) which converges 
to a quantizer Q £ Qd(M) setwise at P. Then, the convergence is also under total 
variation at P. 

Proof. Let P™, . . . , PJf be the cells of Q n - Since Q n - 
we have PQ„(B x {i}) -> PQ(B x {«}) for any B £ S(X). 
Jb l{xeB™}P(dx), we obtain 



> Q setwise at input P, 
Since PQ„(B x {«}) = 



P(P nPf) -+P(Br\Bi), foralU = l,. 



,M. 



If Sj,... 
*,j e {l,. 



Bm are the cells of Q, the above implies P(Bj D P,f) 
. . , M}. Since both {Pf} and {B n } are partitions of 5 



> P(B 3 r\Bi) for all 
we obtain 



P(P™ A Bi) -4 for all i = 1, 
where P," A B = (Bf \B)U(B\Bf). Then we have 



||PQ»- PQHrv 



sup 

/:|I/IU<1 



sup 

/:||/IU<1 



< sup 

/:||/IU<1 
M 



M 

E 

i=l 
M 

E 

i=l 
M 

E 



f(x,i)Q n (i\x)P(dx) 



f(x,i)Q(i\x)P(dx) 



f(x,i)(l{ xeB ?} - ^{xeB?})P{dx) 
\f(x,i)\P(dx) 



1 JB^ABi 



(5.2) 



and convergence in total variation follows. □ 

We next consider quantizers with convex codecells and an input distribution that 
is absolutely continuous with respect to the Lebesgue measure on R" [18] . Assume 
Q £ Qd(M) with cells B\, ... , Pm, each of which is a convex subset of W . By the 
separating hyperplane theorem, there exist pairs of complementary closed half spaces 
{(Hi t j,Hj t i) : 1 < i,j < M,i ^ j} such that for all i = 1, . . . ,M, 

Bi C (~} Hi j. 

Each Bi :— i s a closed convex polytope and by the absolute continuity 

of P one has P{B% \ Bi) = for all i = 1, . . . , M. One can thus obtain a (P-a.s) 
representation of Q by the M(M — l)/2 hyperplanes = Hij n Pj,i- 

Let Qc(-W) denote the collection of M-cell quantizers with convex cells and con- 
sider a sequence {Q n } in Qc(M)- It can be shown (see the proof of Thm. 1 in |18j ) 
that using an appropriate paramctrization of the separating hyperplanes, a subse- 
quence Q 7lk can be can be chosen which converges to a Q £ Qc (M) in the sense that 
P(P" fc A Bi) -> for alH = 1, . . . , M, where the P" fc and the P, are the cells of Q m 
and Q, respectively. In view of ()5.2|) . we obtain the following. 

Theorem 5.8. The set Qc(M) is compact under total variation at any input 
measure P that is absolutely continuous with respect to the Lebesgue measure on M". 
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We can now state an existence result for optimal quantization (problem PI). 

Theorem 5.9. Let P be absolutely continuous and suppose the goal is to find the 
best quantizer Q with M cells minimizing J(P, Q) = inf 7 Ep n (X, U) under assump- 
tion A2, where Q is restricted to Qc(M). Then an optimal quantizer exists. 

Proof. Existence follows from Theorems 14.51 and 15.81 □ 

In the quantization literature finding an optimal quantizer means finding optimal 
codecells and corresponding reconstruction points. Our formulation does not require 
the existence of optimal reconstruction points (i.e., optimal policy 7). For cost func- 
tions of the form c(x, u) = \\x — u\\ p for x, u € M. n and some p > 0, the cells of "good" 
quantizers will be convex by Lloyd-Max conditions of optimality; see |18j for further 
results on convexity of bins for entropy constrained quantization problems. We note 
that pQ also considered such cost functions for existence results on optimal quantizers; 
Graf and Luschgy |15j considered more general norm-based cost functions. 

6. Multi-stage case. We consider the general case TeN. It should be observed 
that the effects of a control policy applied any given time-stage presents itself in two 
ways, in both the cost occurred at the given time-stage and the effect on the process 
distribution at future time-stages, which is known as the dual effect of control [2 

The next theorem shows the continuity of the optimal cost in the observation 
channel under some regularity conditions. Note that the existence of best and worst 
channels follows under an appropriate compactness condition as in Theorem 14.51 (iii). 
We need the following definition. 

Definition 6.1. A sequence of channels {Q n } converges to a channel Q uni- 
formly in total variation if 

lim sup \\Q n ( ■ \x) - Q( ■ II = 0. 

n^oo xeX " 

Note that in the special but important case of additive observation channels, 
uniform convergence in total variation is equivalent to the weaker condition that 
Q n ( ■ \x) — > Q( ■ \x) in total variation for each x. When the additive noise is abso- 
lutely continuous with respect to the Lebesgue measure, uniform convergence in total 
variation is equivalent to requiring that the noise density corresponding to Q n con- 
verges in the L\ sense to the density corresponding to Q. For example, if the noise 
density is estimated from n independent observations using any of the L\ consistent 
density estimates described in e.g. |10j . then the resulting Q n will converge (with 
probability one) uniformly in total variation. 

Theorem 6.2. Consider the cost function hl.ty) with arbitrary T e N. Suppose 
assumption A2 holds. Then, the optimization problem PI is continuous in the ob- 
servation channel in the sense that if {Qn} is a sequence of channels converging to Q 
uniformly in total variation, then 

lim J(P,Q n ) = J(P,Q). 

n— > 00 

Proof. Let e > and pick e-optimal policies IT™ = {jq, 7™, . . . , 7^_ 1 } and IT = 
{70, 7i i • ■ • j 7t-i} for channels Q n and Q, respectively. That is, using the notation 
in ([O]) . we have J{P,Q n ,W L ) < J(P,Q n ) + e and J(P,Q,n) < J(P,Q) + e. The 
argument used to obtain (|3.6[) then gives 



\J{P,Q)-J{P,Q n )\ 
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< max^ J(P, Q, IP) - J(P, Q n , IP), J(P, Q n ,U) - J{P, Q, U)^j + e. (6.1) 

We will show that both terms in the maximum converge to zero. First we consider 
the term 

T-l 

J(P, Q", IP) - J(P, Q, IP) = ET' n " [c(X t , U t )] - Ef un [c(X u U t )\. (6.2) 

t=o 

Under policy IP = {7^,7?, ■ ■ • >7?_i}, we have U t = Jti Y [o,t], ^[o,t-i])- We absorb 
in the notation the dependence of Ut on 7J , . . . , 7™_ 2 and write U t = 7™(Y[o,t])- 
For t = 0, . . . , T - 1 and k = 0, . . . , t define (£ t : X fc x Y fe -> R by setting 

C™t(%,t],y[o,t]) := c(x t ,7™(y [CM ]) 
and defining recursively for k = t — 1, . . . , 



Cfc,t(z[o,/c],y[o,/c]) := / ^(^fe+ikfe,7fe(y[o,fc]))Qn(rfyfe+ikfe+i)Cfc+i,t(^[o,fc+i],2/[o,fe+i])- 

Note that ||C t ™Joo < ||c]|oo and thus HQtlU < ||c|U for all k = * - 1, . . . , 0. 

Fix < k < t and consider a system such that the observation channel is Q at 
stages 0, . . . , k— 1 and Qn at stages fc, fc+1, . . . , t. Let jiijj denote the distribution of the 
resulting process segment (X[ 0jfe i, Yr u) under policy II™ (by definition /i™ = PQ n ). 
Also under policy IT 1 , let i/£ denote the distribution of (Xr u, Yr ,fe]) if the observation 
channel is Q for all the stages 0, . . . , t. Then we have 

Ef- nn [c(X u U t )}= J ^(dx Q ,dy )Q, t (x ,y ) 

and 

Ef n [c(X t ,U t )}= J ^"(^[o,*],^,*])^™*^^,^,*])- 
Note that by construction, for all k = 1, . . . , t 

Mfe {dx[ 0t k] , dy{ ,k])(k,t i x [o,k] , 2/[o,fc] ) 

= / v k-i(dx[o,k-i],dy[o y k-i])Ck-iA x lo,k-i]fyio,k-i])- 



Thus each term in the sum on the right hand side of (16.2[) can be expressed as a 
telescopic sum, which in turn can be bounded term- by-term, as follows: 



\Ef^[c{X u U t )\ -Ef n "{c(X t ,U t )}\ = 



k=0 



I Vk(dx [0tk ],dy [0t k]KkA x [o,k]>yio,k]) 
v k{dx [0 ,k] , dj/[o,fc] )Cfc,i i x [o,k] > 2/[o,fc] ) 
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<En^-^iMia >t iu 

k=l 

t 

<IMI«X>2-''fcllTv. ( 6 - 3 ) 



k=l 

For any Borel set B C X fe xY fe , define B(:E[o,fc],y[o,fe-i]) = {?/fc G Y : (z[o,fc]> 2/[o,fe]) G 
B}, so that 



| M j?(B)-^(S)| = 



Q„ (B(x[o, fc ] , V[o,k-i] )\xk)~ Q(B(x[ 0tk ] , y[ ,k-i] )\x k ) 

< Sup ||Qn(- - Q„(- |a!jfe)[[TV- 
x fc 6X 

The preceding bound and the uniform convergence of {Q n } imply lim„ [[/i™ — fjf [[rv 
for all k. Combining this with (16.31) and (16.21) gives 

J(P, Q n ,U n ) - J(P, Q, IP) -> 0. 

Replacing II™ with II we can use an identical argument to show that J(P, Q™, IT) — > 
J(P, Q, II). Since e > in (|6.ip was arbitrary, the proof is complete. □ 

We obtained the continuity of the optimal cost on the space of channels equipped 
with a more stringent notion for convergence in total variation. This result and its 
proof indicate that further technical complications emerge in multi-stage problems. 
Likewise, upper semi- continuity under weak convergence and setwise convergence re- 
quire more stringent uniformity assumptions, which we leave for future research. 

One further interesting problem regarding the multi-stage case is to consider 
adaptive observation channels. For example, one may aim to design optimal adaptive 
quantizers for a control problem. In this case, Markov Decision Process tools can be 
used for obtaining existence conditions for optimal channels and quantizers. Some 
related results on optimal adaptive quantization are presented in [7 . 

7. Concluding remarks, some implications and future work. This paper 
studied the structural and topological properties of some optimization problems in 
stochastic control in the space of observation channels. The main problem we con- 
sidered is how to approach appropriate notions of convergence and distance while 
studying communication channels in the context of stochastic control problems. 

The restriction to Euclidean state spaces is not essential and many (but not all) 
of the positive results can be extended to the case where X, Y, and U are arbitrary 
Polish spaces. In particular, all the positive results in Sections|3]carry through without 
change, except Theorem l3.2l The results of Section H] hold for this more general setup 
(however, in Lemma [4.3l we need the additional condition that the space is cr-compact). 
Likewise, most of the positive results in Section[5]on quantization hold more generally 
(in fact, Theorem 15.51 holds for an arbitrary measurable space), but two of the main 
results, Theorems 15.81 and 15.91 do need the assumption that X is a finite-dimensional 
Euclidean space. 

7.1. Sufficient conditions for continuity under setwise and weak con- 
vergence. A careful analysis of the proof of Theorem 13.41 reveals that we need a 
uniform convergence principle for setwise convergence to be sufficient for continuity. 
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That is, we wish to have 



lim sup 



Q(dy\x)c{x,j{y)) - / Q n (dy\x)c(x,j(y)) ) P(dx) 



0, (7.1) 



where J 7 is a set of allowable policies, to be able to have continuity under setwise con- 
vergence. Thus, one important question of practical interest, is the following: What 
type of stochastic control problems, cost functions, and allowable policies lead to so- 
lutions which admit such a uniform convergence principle under setwise convergence? 
Some sufficient conditions for uniform setwise convergence are presented in [3D] . 

Likewise, a parallel discussion applies for weak convergence under the assumption 
that for every Q n and for Q, corresponding optimal policies "f n and 7 are continuous 
and are assumed to be from a given class of policies J- . One wants to have 



c(x, j n (y))Q n (dy\x)P(dx) 



c(x,^(y))Q(dy\x)P(dx) 



IxY JXxY 

A sufficient condition for this is the following form of uniform weak convergence: 



lim sup 



j(y))Q„(dy\x)P(dx) 



XxY 



c(x,i(y))Q(dy\x)P(dx) 



ZxY 



0. 



7.2. Empirical consistency of optimal controllers. One issue to discuss is 
the connections of our results with consistency in learning the channel from empirical 
observations. 

When one does not know the system dynamics, such as the observation channel, 
one typically attempts to learn the channel via test inputs or empirical observations. 
Let {(XifYi), i £ N} be an X x Y-valucd i.i.d sequence generated according to some 
distribution /i. Define the the empirical occupation measures for every n £ N, by 
letting 



1 n 



{(Xi,Fi)es}, 



for every measurable BcIxY. Then one has n n {B) — > n{B) almost surely (a.s.) 
by the strong law of large numbers. However, it is generally not true that ji n \l 
setwise a.s. (e.g., /i n never converges to [i setwise when either Xi or Yi has a nonatomic 
distribution), in which case fi n cannot converge to \i in total variation. 

On the other hand, again by the strong law, for any /i-integrable function / on 
XxY, one has, almost surely, 



f(x,y)n n (dx,dy) = / f(x,y)n(dx,dy) 



lim 



In particular, fi n fi weakly with probability one |13j . 

In the learning theoretic context, the convergence of the costs optimal for [i n to 
the cost optimal for /1 is called the consistency of empirical risk minimization (see [3 2) 
for an overview). In particular, if the cost function and the allowable control policies 
T are such that 



lim sup 



c(x,j(y))fx n (dx,dy) - / c(x,j(y))/J,(dx,dy) 



0, 
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then we obtain consistency. 

A class of measurable functions £ is called a Glivenko-Cantelli class [12) . if the 
integrals with respect to the empirical measures converge almost surely to the integrals 
with respect to the true measure uniformly over £. Thus, if 

g = {7 : c(x,7(j/)) € £}, 

where £ is a class of Glivenko-Cantelli family of functions, then we could establish 
consistency. One example of a Glivenko-Cantelli family of real functions on R N is the 
family {/ : < M} for some < M < 00, where || • \\bl denotes the bounded 

Lipschitz norm |12) . 

Thus, if we restrict the class of control policies, and given a cost function, we can 
obtain consistency and robustness to mismatch in the channel due to learning. The 
classification of the class of objective functions and policies which would lead to such 
a consistency result is a future research problem. 

8. Appendix. 

8.1. Proof of Lemma 12.21 (i) Since c(x, ■ ) is continuous and bounded on Y 
for all x, we have 

r.(x.ii)PO„(d.x d.n) — lim f ( f r.(T..iAO„(dii\x 



n—>oo 



lim / c(x,y)PQ n (dxdy) = lim / ( / c(x, y)Q n {dy\x) ) P(dx) 

XxY n^c 



c(x,y)Q(dy\x) J P{dx) 
c(x,y)PQ(dx,dy) 

XxY 

where first we used Fubini's theorem, and then the dominated convergence theorem 
and the fact that J x c{x, y)Q n (dy\x) is bounded and converges to J x c(x,y)Q(dy\x) 
for P-a.e. x. 

(ii) Let A G B(K x Y) and for x, let A x = {y : (x, y) S A}. Similarly to the previous 
proof, 



lim PQ n (A) = lim / Q n (A x \x)P(dx) 

71— >oo n— ¥00 J-^ 



Q(A x \x)P(dx) 
= PQ(A) 

by the dominated convergence theorem since lim n _ ! . 00 Q n (A x \x) = Q{A x \x) for P-a.e. 
x. 

(iii) We have 

sup \PQ n {A) -PQ{A)\ = sup 

AeB(XxY) Aei?(Xxl 



Q n {A x \x)P(dx) - / Q{A x \x)P{dx) 
Jx 

< sup / \Q n (A x \x) - Q(A x \x)\P(dx) 

AeB(XxY) Jx 

< [ sup \Q n {B\x) - Q(B\x)\P{dx). 

JX BeB(Y) 
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Since sup Q n (B|x) — Q(£?|x)| —5- for P-a.e. x, an application of the dominated 
convergence theorem completes the proof. □ 



8.2. Proof of Theorem [3H1 We have 

J(P, Q) = inf f c(x, j(y))Q(dy\y)P(da 



Let (X,Y) ~ PQ and let P( ■ \y) be the (regular) conditional distribution of X given 
Y = y. If (PQ)y denotes the distribution of Y, then 

J(P,Q) = inf f f c(x,j(y))P(dx\y)(PQ) Y (dy) 



inf / c(x,u)P{dx\y))(PQ) Y {dy). 

where the validity of the second equality is explained below. 

By assumption A3, c is bounded and c(x,u n ) — > c(x,u) if u n — > u for all x; thus 
by the dominated convergence theorem 

c(x, u n )P(dx\y) — > I c(x,u)P(dx\y) 
Jx 

proving that g(u,y) = j x c(x,u)P(dx\y) is continuous in u for each y. Since U is 
compact, there exists 7*(y) S U such that g("i*{y),y) = inf„ e u g(u, y). A standard 
argument shows that 7* : Y — >• U can be taken to be measurable (see, e.g., Appendix D 
of [H]) and we have 

J(P,Q)= / c(x n *(y))Q(dy\y)P(dx). □ 
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